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CONCENTRATION AND REGULARIZATION OF RANDOM GRAPHS 


CAN M. LE, ELIZAVETA LEVINA AND ROMAN VERSHYNIN 


Abstract. This paper studies how close random graphs are typically to their expecta¬ 
tions. We interpret this question through the concentration of the adjacency and Laplacian 
matrices in the spectral norm. We study inhomogeneous Erdos-Renyi random graphs on n 
vertices, where edges form independently and possibly with different probabilities pij. Sparse 
random graphs whose expected degrees are o(logn) fail to concentrate; the obstruction is 
caused by vertices with abnormally high and low degrees. We show that concentration can 
be restored if we regularize the degrees of such vertices, and one can do this in various ways. 
As an example, let us reweight or remove enough edges to make all degrees bounded above 
by 0(d) where d = ma xnpij. Then we show that the resulting adjacency matrix A' concen¬ 
trates with the optimal rate: || A' — E A\\ = 0(Vd). Similarly, if we make all degrees bounded 
below by d by adding weight d/n to all edges, then the resulting Laplacian concentrates with 
the optimal rate: ||£(A') — £(EA/)|| = 0(1/Vd). Our approach is based on Grothendieck- 
Pietsch factorization, using which we construct a new decomposition of random graphs. We 
illustrate the concentration results with an application to the community detection problem 
in the analysis of networks. 


1. Introduction 

Many classical and modern results in probability theory, starting from the Law of Large 
Numbers, can be expressed as concentration of random objects about their expectations. The 
objects studied most are sums of independent random variables, martingales, nice functions 
on product probability spaces and metric measure spaces. For a panoramic exposition of con¬ 
centration phenomena in modern probability theory and related fields, the reader is referred 
to the books [25, 9]. 

This paper studies concentration properties of random graphs. The first step of such study 
should be to decide how to interpret the statement that a random graph G concentrates near 
its expectation. To do this, it will be useful to look at the graph G through the lens of the 
matrices classically associated with G, namely the adjacency and Laplacian matrices. 

Let us first build the theory for the adjacency matrix A; the Laplacian will be discussed 
in Section 1.5. We may say that G concentrates about its expectation if A is close to its 
expectation EA in some natural matrix norm; we interpret the expectation of G as the 
weighted graph with adjacency matrix EA. Various matrix norms could be of interest here. 
In this paper, we study concentration in the spectral norm. This automatically gives us 
a tight control of all eigenvalues and eigenvectors, according to Weyl’s and Davis-Kahan 
perturbation inequalities (see [5, Sections III.2 and VII.3]). 

Concentration of random graphs interpreted this way, and also of general random matrices, 
has been studied in several communities, in particular in random matrix theory, combinatorics 
and network science. 

Date : August 10, 2016. 

E. L. is partially supported by NSF grants DMS-1159005 and DMS-1521551. R. V. is partially supported 
by NSF grant 1265782 and U.S. Air Force grant FA9550-14-1-0009. This work was done while C. L. was a 
Pli.D. student at the University of Michigan. 


1 



2 


CAN M. LE, ELIZAVETA LEVINA AND ROMAN VERSHYNIN 


We will study random graphs generated from an inhomogeneous Erdos-Renyi model G{n , (pij)), 
where edges are formed independently with given probabilities , see [7]. This is a gener¬ 
alization of the classical Erdos-Renyi model G(n,p) where all edge probabilities equal 
p. Many popular graph models arise as special cases of G(n, ( p^ )), such as the stochastic 
block model, a benchmark model in the analysis of networks [22] discussed in Section 1.7, 
and random subgraphs of given graphs. 

Often, the question of interest is estimating some features of the probability matrix (pij) 
from random graphs drawn from G(n, ( p ^)). Concentration of adjacency matrix and Lapla- 
cian matrix around their expectations, when it holds, guarantees that such features can be 
recovered. As an example of this use of our concentration results, we will show that if (pij) 
has a block structure, the blocks can be accurately estimated from a single realization of 
G(n, ( p ^)) even when the average vertex degree is bounded. 

1.1. Dense graphs concentrate. The cleanest concentration results are available for the 
classical Erdos-Renyi model G(n,p ) in the dense regime. In terms of the expected degree 
d = pn , we have with high probability that 

\\A — E A\\ = 2\fd (1 + o(l)) if dS>log 4 n, (1.1) 

see [16, 44, 28]. Since ||EA|| = d, we see that the typical deviation here behaves like the 
square root of the magnitude of expectation - just like in many other classical results of 
probability theory. In other words, dense random graphs concentrate well. 

The lower bound on density in (1.1) can be essentially relaxed all the way down to d = 
fl(logn). Thus, with high probability we have 

\\A-EA\\=0(Vd) if d = fl(logn). (1.2) 

This result was proved in [15] based on the method developed by J. Kahn and E. Szemeredi 
[17]. More generally, (1.2) holds for any inhomogeneous Erdos-Renyi model G(n. (p^)) with 
maximal expected degree d = max* J2jPij- This generalization can be deduced from a recent 
result of S. Bandeira and R. van Handel [4, Corollary 3.6], while a weaker bound 0{^/d logn) 
follows from concentration inequalities for sums of independent random matrices [35]. Alter¬ 
natively, an argument in [15] can be used to prove (1.2) for a somewhat larger but still useful 
value 

d = m&xnpij, (1.3) 

ij 

see [27, 12]. The same can be obtained by using Seginer’s bound on random matrices [20]. 

As we will see shortly, our paper provides an alternative and completely different approach 
to general concentration results like (1.2). 

1.2. Sparse graphs do not concentrate. In the sparse regime, where the expected degree 
d is bounded, concentration breaks down. According to [24], a random graph from G(n,p) 
satisfies with high probability that 

Mil = (1 + o(l ))Vd(A) = (1 + o(l)) J if i = 0( 1), (1.4) 

where d(A) denotes the maximal degree of the graph (a random quantity). So in this regime 
we have ||A|| 3> || EA|| = d. which shows that sparse random graphs do not concentrate. 

What exactly makes the norm A abnormally large in the sparse regime? The answer is, 
the vertices with too high degrees. In the dense case where d logn, all vertices typically 
have approximately the same degrees (1 + o(l))d. This no longer happens in the sparser 
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regime d <C logn; the degrees do not cluster tightly about the same value anymore. There 
are vertices with too high degrees; they are captured by the second inequality in (1.4). Even 
a single high-degree vertex can blow up the norm of the adjacency matrix. Indeed, since the 
norm of A is bounded below by the Euclidean norm of each of its rows, we have || A|| > y/d(A). 

1.3. Regularization enforces concentration. If high-degree vertices destroy concentra¬ 
tion, can we “tame” these vertices? One proposal would be to remove these vertices from the 
graph altogether. U. Feige and E. Ofek [15] showed that this works for G(n,p) - the removal 
of the high degree vertices enforces concentration. Indeed, if we drop all vertices with degrees, 
say, larger than 2d, the the remaining part of the graph satisfies 

\\A' -EA'\\ = 0(Vd) (1.5) 

with high probability, where A’ denotes the adjacency matrix of the new graph. The argument 
in [15] is based on the method developed by J. Kahn and E. Szemeredi [17]. It extends to 
the inhomogeneous Erdos-Renyi model G{n, ( pij )) with d defined in (1.3), see [27, 12]. As we 
will see, our paper provides an alternative and completely different approach to such results. 

Although the removal of high degree vertices solves the concentration problem, such solu¬ 
tion is not ideal, since those vertices are in some sense the most important ones. In real-world 
networks, the vertices with highest degrees are “hubs” that hold the network together. Their 
removal would cause the network to break down into disconnected components, which leads 
to a considerable loss of structural information. 

Would it be possible to regularize the graph in a more gentle way - instead of removing 
the high-degree vertices, reduce the weights of their edges just enough to keep the degrees 
bounded by 0(d)l The main result of our paper states that this is true. Let us first state 
this result informally; Theorem 2.1 provides a more general and formal statement. 

Theorem 1.1 (Concentration of regularized adjacency matrices). Consider a random graph 
from the inhomogeneous Erdos-Renyi model G(n, (pij)), and let d = max,j np tJ . For all high 
degree vertices of the graph (say, those with degrees larger than 2d), reduce the weights of 
the edges incident to them in an arbitrary way, but so that all degrees of the new (weighted) 
graph become bounded by 2d. Then, with high probability, the adjacency matrix A' of the new 
graph concentrates: 

\\A'-EA\\ = O(Vd). 

Moreover, instead of requiring that the degrees become bounded by 2d, we can require that the 
£2 norms of the rows of the new adjacency matrix become bounded by \[2d. 

1.4. Examples of graph regularization. The regularization procedure in Theorem 1.1 is 
very flexible. Depending on how one chooses the weights, one can obtain as partial cases 
several results we summarized earlier, as well as some new ones. 

1. Do not do anything to the graph. In the dense regime where d = fl(logn), all degrees are 
already bounded by 2d with high probability. This means that the original graph satisfies 
||A — EA|| = O(Vd). Thus we recover the result of U. Feige and E. Ofek (1.2), which 
states that dense random graphs concentrate well. 

2. Remove all high degree vertices. If we remove all vertices with degrees larger than 2d, we 
recover another result of U. Feige and E. Ofek (1.5), which states that the removal of the 
high degree vertices enforces concentration. 

3. Remove just enough edges from high-degree vertices. Instead of removing the high-degree 
vertices with all of their edges, we can remove just enough edges to make all degrees 
bounded by 2d. This milder regularization still produces the concentration bound (1.5). 
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4. Reduce the weight of edges proportionally to the excess of degrees. Instead of removing 
edges, we can reduce the weight of the existing edges, a procedure which better preserves 
the structure of the graph. For instance, we can assign weight y/XiXj to the edge between 
vertices i and j, choosing A* := min(2 d/di, 1) where di is the degree of vertex i. One can 
check that this makes the £2 norms of all rows of the adjacency matrix bounded by 2d. By 
Theorem 1.1, such regularization procedure leads to the same concentration bound (1.5). 


1.5. Concentration of Laplacian. So far, we have looked at random graphs through the 
lens of their adjacency matrices. A different matrix that captures the geometry of a graph is 
the (symmmetric, normalized) Laplacian matrix, defined as 

C(A) = D~ 1/2 {D - A)D ~ 1/2 = I — D~ 1/2 AD~ 1/2 . (1.6) 

Here I is the identity matrix and D = diag(dj) is the diagonal matrix with degrees di = 
£"=1 A* on the diagonal. The reader is referred to [13] for an introduction to graph Lapla- 
cians and their role in spectral graph theory. Here we mention just two basic facts: the 
spectrum of C{A) is a subset of [0, 2], and the smallest eigenvalue is always zero. 

Concentration of Laplacians of random graphs has been studied in [35, 11, 39, 23, 18]. Just 
like the adjacency matrix, the Laplacian is known to concentrate in the dense regime where 
d = fl(logn), and it fails to concentrate in the sparse regime. However, the obstructions 
to concentration are opposite. For the adjacency matrices, as we mentioned, the trouble is 
caused by high-degree vertices. For the Laplacian, the problem lies with low-degree vertices. 
In particular, for d = o(logn) the graph is likely to have isolated vertices; they produce 
multiple zero eigenvalues of C(A), which are easily seen to destroy the concentration. 

In analogy to our discussion of adjacency matrices, we can try to regularize the graph to 
“tame” the low-degree vertices in various ways, for example remove the low-degree vertices, 
connect them to some other vertices, artificially increase the degrees di in the definition (1.6) 
of Laplacian, and so on. Here we will focus on the following simple way of regularization 
proposed in [3] and analyzed in [23, 18]. Choose r > 0 and add the same number r/n to all 
entries of the adjacency matrix A, thereby replacing it with 

A t := A + (t/ti)11 t 

in the definition (1.6) of the Laplacian. This regularization raises all degrees di to di + r. If 
we choose r ~ d, the regularized graph does not have low-degree vertices anymore. 

The following consequence of Theorem l.l shows that such regularization indeed forces 
Laplacian to concentrate. Here we state this result informally; Theorem 4.1 provides a more 
formal statement. 

Theorem 1.2 (Concentration of the regularized Laplacian). Consider a random graph from 
the inhomogeneous Erdos-Renyi model G(n, (pij)), and let d = ma Xijnpij. Choose a number 
t ~ d. Then, with high probability, the regularized Laplacian C(A T ) concentrates: 

|m.)-£(E.4 r )||=o(-L). 

We will deduce this result from Theorem 1.1 in Section 4. Theorem 1.2 is an improvement 
upon a bound in [18] that had an extra logd factor, and it was conjectured there that the 
logarithmic factor is not needed. Theorem 1.2 confirms this conjecture. 
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1.6. A numerical experiment. To conclude our discussion of various ways to regularize 
sparse graphs, let us illustrate the effect of regularization by a numerical experiment. Consider 
an inhomogeneous Erdos-Renyi graph with n = 1000 vertices, 90% of which have expected 
degrees 7 and 10% percent have expected degrees 35. We then regularize the graph by 
reducing the weights of edges proportionally to the excess of degrees - just as we described in 
Section 1.4 item 4, except that we use the overall average degree (approximately 10) instead 
of d (which results in a more severe weight reduction suitable for our illustration purpose). 

Figure 1 shows the histogram of the spectrum of A (left) and A' (right). As we can 
see, the high degree vertices lead to the long tails in the histogram of the eigenvalues, and 
regularization shrinks these tails toward the bulk. 
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Figure 1. Histogram of the spectrum of adjacency matrix A (left) and regu¬ 
larized adjacency matrix A' (right) for a sparse random graph generated from 
the inhomogeneous Erdos-Renyi model with n = 1000 vertices, 90% of which 
have expected degrees 7 and 10% percent have expected degrees 35. 

1.7. Application: community detection in networks. Concentration of random graphs 
has an important application to statistical analysis of networks, in particular to the problem of 
community detection. A common way of modeling communities in networks is the stochastic 
block model [22] , which is a special case of the inhomogeneous Erdos-Renyi model considered in 
this paper. For the purpose of this example, we focus on the simplest version of the stochastic 
block model G(n, (r, ^), also known as the balanced planted partition model, defined as 
follows. The set of vertices is divided into two subsets (communities) of size n/2 each. 
Edges between vertices are drawn independently with probability a/n if they are in the same 
community and with probability b/n otherwise. 

The community detection problem is to recover the community labels of vertices from a 
single realization of the random graph model. A large literature exists on both the recovery 
algorithms and the theory establishing when a recovery is possible [14, 33, 34, 32, 1, 29, 8]. 
There are methods that perform better than a random guess (i.e. the fraction of misclassihed 
vertices is bounded away from 0.5 as n —> oo with high probability) under the condition 

(o — b ) 2 > 2(a + b), 

and no method can perform better than a random guess if this condition is violated. 

Moreover, strong consistency , or exact recovery (labeling all vertices correctly with high 
probability) is possible when the expected degree (a + b)/ 2 is of order logn or larger and 
a and b are sufficiently separated, see [32, 30, 6, 20, 10]. Weak consistency (the fraction of 
mislabeled vertices going to 0 with high probability) is achievable if and only if 

(a — b ) 2 > C n {a + b) with C n — > oo, 
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see [32]. Many of these results hold in the non-asymptotic regime, for graphs of fixed size 
n. Thus, for any e > 0 there exists C £ (which only depends on e) such that one can recover 
communities up to en mislabeled vertices as long as 

(a-b) 2 > C £ (a + b). 

In particular, recovery of communities is possible even for very sparse graphs - those with 
bounded expected degrees. Several types of algorithms are known to succeed in this regurne, 
including non-backtracking walks [33, 29, 8], spectral methods [12] and methods based on 
semidefinite programming [19, 31]. 

As an application of the new concentration results, we show that the regularized spectral 
clustering [3, 23], one of the simplest most popular algorithms for community detection, can 
recover communities in the sparse regime. In general, spectral clustering works by com¬ 
puting the leading eigenvectors of either the adjacency matrix or the Laplacian or their 
regularized versions, and running the £;-means clustering algorithm on these eigenvectors to 
recover the node labels. In the simple case of the model G(n, -, - ), one can simply assign 
nodes to communities based on the sign (positive or negative) of the corresponding entries 
of the eigenvector V 2 (C(A r )) corresponding to the second smallest eigenvalue of regularized 
Laplacian matrix C(A T ) (or the regularized adjacency matrix A'). 

Let us briefly explain how our concentration results validate regularized spectral cluster¬ 
ing. If the concentration of random graphs holds and C(A T ) is close to £(Ej 4 t ), then the 
standard perturbation theory (Davis-Kahan theorem below) shows that V 2 (C(A T )) is close 
to U 2 (£(IE A r )), and in particular, the signs of these two eigenvectors must agree on most 
vertices. An easy calculation shows that the signs of t> 2 (£(IE A T )) recover the communities 
exactly: this vector is a positive constant on one community and a negative constant on the 
other. Therefore, the signs of V 2 (C(A r )) must recover the communities up to a small fraction 
of misclassified vertices. 

Before stating our result, let us quote a simple version of the Davis-Kahan theorem per¬ 
turbation theorem (see e.g. [5, Theorem VII.3.2]). 

Theorem 1.3 (Davis-Kahan theorem). Let X, Y be symmetric matrices such that the second 
smallest eigenvalues of X and Y have multiplicity one and they are of distance at least 6 > 0 
from the remaining eigenvalues of X and Y. Denote by x and y the eigenvectors of X and 
Y corresponding to the second largest eigenvalues of X and Y, respectively. Then 

n n 2 ll A "- y ll 

mm \\x + py\\ < ---. 

/3=±i" - 6 

Corollary 1.4 (Community detection in sparse graphs). Let £ > 0 and r > 1. Let A be the 
adjacency matrix drawn from the stochastic block model G(n, ^). Assume that 

(a-b) 2 >C e {a + b) (1.7) 

where C £ = Cr^e -2 and C is an appropriately large absolute constant. Choose r to be the 
average degree of the graph, i.e. r = (d\ + ■ ■ ■ + d n )/n where di are vertex degrees. Then with 
probability at least 1 — e~ r , we have 

min \\v 2 (C(A T )) + (5v2(C{E A T ))\\ < e. 

In particular, the signs of the entires of V 2 (C(A T )) correctly estimate the partition into the 
two communities, up to at most en misclassified vertices. 
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Proof. We apply Theorem 1.3 with X = C(A r ) and Y = C(JLA T ). A simple calculation shows 
that the spectral gap <5 defined in Theorem 1.3 is of the order (a — b)/(a + b). The claim 
of Corollary 1.4 then follows from Davis-Kahan Theorem 1.3, Concentration Theorem 4.1 
(which is a formal version of Theorem 1.2) and condition (1.7). □ 

1.8. Organization of the paper. In Section 2, we state a formal version of Theorem 1.1. 
We show there how to deduce this result from a new decomposition of random graphs, which 
we state as Theorem 2.6. We prove this decomposition theorem in Section 3. In Section 4, we 
state and prove a formal version of Theorem 1.2 about the concentration of the Laplacian. We 
conclude the paper with Section 5 where we propose some questions for further investigation. 

Acknowledgement. The authors are grateful to Ramon van Handel for several insightful 
comments on the preliminary version of this paper. 

2. Full version of Theorem 1.1, and reduction to a graph decomposition 

In this section we state a more general and quantitative version of Theorem 1.1, and we 
reduce it to a new form of graph decomposition, which can be of interest on its own. 

Theorem 2.1 (Concentration of regularized adjacency matrices). Consider a random graph 
from the inhomogeneous Erdds-Renyi model G(n,(pij)), and let d = ma Xijnpij. For any 
r > 1, the following holds with probability at least 1 — n~ r . Consider any subset consisting 
of at most 10 n/d vertices, and reduce the weights of the edges incident to those vertices in 
an arbitrary way. Let d' be the maximal degree of the resulting graph. Then the adjacency 
matrix A! of the new (weighted) graph satisfies 

\\A! -EA|| < CV 3 / 2 (Vd + Vdf). 

Moreover, the same bound holds for d! being the maximal 1 2 norm of the rows of A'. 

In this result and in the rest of the paper, C , C\, C 2 , • • • denote absolute constants whose 
values may be different from line to line. 

Remark 2.2 (Theorem 2.1 implies Theorem 1.1). The subset of 10 n/d vertices in Theorem 2.1 
can be completely arbitrary. So let us choose the high-degree vertices, say those with degrees 
larger than 2d. There are at most 10 n/d such vertices with high probability; this follows by 
an easy calculation, and also from Lemma 3.5. Thus we immediately deduce Theorem 1.1. 

Remark 2.3 (Tight upper bound). If we do not reduce weights of any edges and d is bounded, 
then the upper bound in Theorem 2.1 is tight (up to a constant depending on r). This is 
because of (1.4), which states that the adjacency matrix does not concentrate in the sparse 
regime without regularization. 

Remark 2.4 (Method to prove Theorem 2.1). One may wonder if Theorem 2.1 can be proved 
by developing an e-net argument similar to the method of J. Kahn and E. Szemeredi [17] 
and its versions [2, 15, 27, 12]. Although we can not rule out such possibility, we do not see 
how this method could handle a general regularization. The reader familiar with the method 
can easily notice an obstacle. The contribution of the so-called light couples becomes hard 
to control when one changes, and even reduces, the individual entries of A (the weights of 
edges). 

We will develop an alternative and somewhat simpler approach, which will be able to 
handle a general regularization of random graphs. It sheds light on the specific structure of 
graphs that enables concentration. We are going to identify this structure through a graph 
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decomposition in the next section. But let us pause briefly to mention the following useful 
reduction. 

Remark 2.5 (Reduction to directed graphs). Our arguments will be more convenient to carry 
out if the adjacency matrix A has all independent entries. To be able to make this assumption, 
we can decompose A into the upper-triangular and a lower-triangular parts, both of which 
have independent entries. If we can show that each of these parts concentrate about its 
expectation, it would follow that A concentrate about E A by triangle inequality. 

In other words, we may prove Theorem 2.1 for directed inhomogeneous Erdos-Renyi graphs, 
where edges between any vertices and in any direction appear indepednently with probabilities 
Pij. In the rest of the argument, we will only work with such random directed graphs. 

2.1. Graph decomposition. In this section, we reduce Theorem 2.1 to the following decom¬ 
position of inhomogeneous Erdos-Renyi directed random graphs. This decomposition may 
have an independent interest. Throughout the paper, we denote by Bjy the matrix which 
coincides with a matrix B on a subset of edges M C [n] x [n] and has zero entries elsewhere. 

Theorem 2.6 (Graph decomposition). Consider a random directed graph from the inhomo¬ 
geneous Erdos-Renyi model, and let d he as in (1.3). For any r > 1, the following holds with 
probability at least 1 — 3 n~ r . One can decompose the set of edges [n] x [n] into three classes 
M, 1Z and C so that the following properties are satisfied for the adjacency matrix A. 

• The graph concentrates on M, namely ||(A — E A)_v|| < Cr 3 ^ 2 y/~d. 

• Each row of An and each column of Aq has at most 32r ones. 

Moreover, 1Z intersects at most n/d columns, and C intersects at most n/d rows of [n] X [n]. 

Figure 2 illustrates a possible decomposition Theorem 2.6 can provide. The edges in A f 
form a big “core” where the graph concentrates well even without regularization. The edges 
in 1Z and C can be thought of (at least heuristically) as those attached to high-degree vertices. 



Figure 2. An example of graph decomposition in Theorem 2.6. 

We will prove Theorem 2.6 in Section 3; let us pause to deduce Theorem 2.1 from it. 

2.2. Deduction of Theorem 2.1. First, let us explain informally how the graph decom¬ 
position could lead to Theorem 2.1. The regularization of the graph does not destroy the 
properties of A f, 1Z and C in Theorem 2.6. Moreover, regularization creates a new property 
for us, allowing for a good control of the columns of 1Z and rows of C. Let us focus on An to 
be specific. The i\ norms of all columns of this matrix are at most d' , and the i\ norms of 
all rows are O(r) by Theorem 2.6. By a simple calculation which we will do in Lemma 2.7, 
this implies that HA^H = 0(Vrd'). A similar bound can be proved for C. Combining A f, 1Z 
and C together will lead to the error bound 0(r 3 / 2 (\/d + VdZ)) in Theorem 2.1. 

To make this argument rigorous, let us start with the simple calculation we just mentioned. 
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Lemma 2.7. Consider a matrix B in which each row has t\ norm at most a, and each 
column has i\ norm at most b. Then ||.B|| < y/ab. 

Proof. Let x be a vector with ||x ||2 = 1. Using Cauchy-Schwarz inequality and the assump¬ 
tions, we have 

\\Bx\\i = E(E B <«) 2 < E(Ei b «iEi b <>?) 

i j i j j 

^ =a J2 x jJ2\ B v\< a J2 x J b = ab - 

i j j i j 

Since x is arbitrary, this completes the proof. □ 

Remark 2.8 (Riesz-Thorin interpolation theorem implies Lemma 2.7). Lemma 2.7 can also 
be deduced from Riesz-Thorin interpolation theorem (see e.g. [40, Theorem 2.1]), since the 
maximal t\ norm of columns is the t\ —> t\ operator norm, and the maximal t\ norm of rows 
is the —> too operator norm. 

We are ready to formally deduce the main part of Theorem 2.1 from Theorem 2.6; we defer 
the “moreover” part to Section 3.6. 

Proof of Theorem 2.1 (main part). Fix a realization of the random graph that satisfies the 
conclusion of Theorem 2.6, and decompose the deviation A' — E A as follows: 

A! - E A = (A 1 - E A)# + (A 1 - E A) n + (A'-E A) c . (2.1) 

We will bound the spectral norm of each of the three terms separately. 

Step 1. Deviation on J\f. Let us further decompose 

(A 1 -EA) m = (A-EA) m -(A-A') m . (2.2) 

By Theorem 2.6, ||(-4 — EA)jv|| < Cr 3 / 2 y[d. To control the second term in (2.2), denote by 
S C [n] x [to] the subset of edges that are reweighed in the regularization process. Since A 
and A' are equal on <? c , we have 

IK-4 - A!) aKI = IK-4 - A!)j^rs\\ < HAv/rdl (since 0 < A - A' < A entrywise) 

< ||(-4 - IE A)jvn £|| + || IE-A/vnfH (by triangle inequality). (2.3) 

Further, a simple restriction property implies that 

\\(A-EA) A r ne \\<2\\(A-EA) M \\. (2.4) 

Indeed, restricting a matrix onto a product subset of [to] X [to] can only reduce its norm. 
Although the set of reweighted edges £ is not a product subset, it can be decomposed into 
two product subsets: 

£ = (/ X [to]) U (r x I) (2.5) 

where I is the subset of vertices incident to the edges in £. Then (2.4) holds; right hand side 
of that inequality is bounded by Cr 3 ^ 2 \/d by Theorem 2.6. Thus we handled the first term 
in (2.3). 

To bound the second term in (2.3), we can use another restriction property that states 
that the norm of the matrix with non-negative entries can only reduce by restricting onto 
any subset of [to] x [to] (whether a product subset or not). This yields 

II E-4yvn£:|| < ll E -4,f|| < ||EA /x [ n ]|| + || E-4/c x/ || 


( 2 . 6 ) 
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where the second inequality follows by (2.5). By assumption, the matrix EA/ X r n j has |/| < 
10 n/d rows and each of its entries is bounded by d/n. Hence the £\ norm of all rows is 
bounded by d, and the i\ norm of all columns is bounded by 10. Lemma 2.7 implies that 
|| E7L/ X [ n ]|j < \/l0 d. A similar bound holds for the second term of (2.6). This yields 

IlEAjvnfll < 5\/d, 

so we handled the second term in (2.3). Recalling that the first term there is bounded by 
Cr 3 / 2 \/d, we conclude that ||(A — A 7 )jv|| < 2 Cr 3 / 2 Vd. 

Returning to (2.2), we recall that the first term in the right hand is bounded by Cr 3 ^ 2 yfd, 
and we just bounded the second term by 2 Cr 3 ! 2 \fd. Hence 

||(A 7 — E A)jv|| < 4 Cr 3 ' 2 Vd. 

Step 2. Deviation on 1Z and C. By triangle inequality, we have 

||(A 7 -EA)*||<||Ay + ||EA*||. 

Recall that 0 < A^ < An entrywise. By Theorem 2.6, each of the rows of An, and thus also 
of A^, has t\ norm at most 32r. Moreover, by definition of d' , each of the columns of A', 
and thus also of A^, has £\ norm at most d! . Lemma 2.7 implies that ||A^|| < \/32 rd! . 

The matrix EA^ can be handled similarly. By Theorem 2.6, it has at most n/d entries 
in each row, and all entries are bounded by d/n. Thus each column of EA^ has l\ norm at 
most 1, and and each row has £\ norm at most d. Lemma 2.7 implies that || E An \| < Vd. 
We showed that 

||(A' - E A)ft|| < V32 rd' + Vd. 

A similar bound holds for || (A 7 — E A)c ||. Combining the bounds on the deviation of A 7 — E A 
on AT, 77 and C and putting them into (2.1), we conclude that 

|| A 7 — E A|| < 4 Cr^ 2 Vd + 2 (V32 rd' + Vd). 

Simplifying this inequality, we complete the proof of the main part of Theorem 2.1. □ 

3. Proof of Decomposition Theorem 2.6 

3.1. Outline of the argument. We will construct the decomposition in Theorem 2.6 by 
an iterative procedure. The first and crucial step is to find a big block 1 N' C [n] x [n] of size 
at least (n — n/d) x n/2 on which A concentrates, i.e. 

|| (A — E A)jV'|| = 0(Vd). 

To find such block, we first establish concentration in —> £2 norm; this can be done by 

standard probabilistic techniques. Next, we can automatically upgrade this to concentration 
in the spectral norm (£2 —> £ 2 ) once we pass to an appropriate block M'. This can be 
done using a general result from functional analysis, which we call Grothendieck-Pietsch 
factorization. 

Repeating this argument for the transpose, we obtain another block J\f" of size at least 
n/2 x (n — n/d ) where the graph concentrates as well. So the graph concentrates on A/"o := 
A f AM". The “core” No will form the first part of the class N we are constructing. 

It remains to control the graph on the complement of Nq. That set of edges is quite small; 
it can be described as a union of a block Co with n/d rows, a block IZq with n/d columns and 

An this paper, by block we mean a product set I x J with arbitrary index subsets I,JC [n]. These subsets 
are not required to be intervals of successive integers. 
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an exceptional nj 2 x n/2 block; see Figure 3b for illustration. We may consider Co and IZq 
as the first parts of the future classes C and 1Z we are constructing. 

Indeed, since Co has so few rows, the expected number of ones in each column of Co is 
bounded by 1. For simplicity, let us think that all columns of Co have 0(1) ones as desired. 
(In the formal argument, we will add the bad columns to the exceptional block.) Of course, 
the block IZq can be handled similarly. 

At this point, we decomposed [ n ] x [n] into A/o, TZq, Co and an exceptional n /2 x n /2 block. 
Now we repeat the process for the exceptional block, constructing A/i, 1Z \, and C\ there, and 
so on. Figure 3c illustrates this process. At the end, we choose M, 7Z and C to be the unions 
of the blocks A/), IZi and C* respectively. 


n /2 


n/d 



(a) The core. 




(c) Final decomposition. 


Figure 3. Constructing decomposition iteratively in the proof of Theorem 2.6. 


Two precautions have to be taken in this argument. First, we need to make concentration 
on the core blocks A/) better at each step, so that the sum of those error bounds would not 
depend of the total number of steps. This can be done with little effort, with the help of 
the exponential decrease of the size of the blocks A/). Second, we have a control of the sizes 
but not locations of the exceptional blocks. Thus to be able to carry out the decomposition 
argument inside an exceptional block, we need to make the argument valid uniformly over 
all blocks of that size. This will require us to be delicate with probabilistic arguments, so we 
can take a union bound over such blocks. 


3.2. Grothendieck-Pietsch factorization. As we mentioned in the previous section, our 
proof of Theorem 2.6 is based on Grothendieck-Pietsch factorization. This general and well 
known result in functional analysis [36, 37] has already been used in a similar probabilistic 
context, see [26, Proposition 15.11]. 

Grothendieck-Pietsch factorization compares two matrix norms, the I 2 —;► ^2 norm (which 
we call the spectral norm ) and the £00 —;► (2 norm. For a k x m matrix B, these norms are 
defined as 


B 


max 11 112 5 

|x||2=l 


i?||oo->-2 = max ||5x || 2 

||x|| oo~l 


max 11 112 ■ 

xe{—i,i} m 


The foo —> P -2 norm is usually easier to control, since the supremum is taken with respect to 
the discrete set {— 1 , l} m , and any vector there has all coordinates of the same magnitude. 
To compare the two norms, one can start with the obvious inequality 


\B\ 


00—>2 


< II5II < 1151 


m 


00—>2 ■ 
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Both parts of this inequality are optimal, so there is an unavoidable slack between the upper 
and lower bounds. However, Grothendieck-Pietsch factorization allows us to tighten the 
inequality by changing B sightly. The next two results offer two ways to change B - introduce 
weights and pass to a sub-matrix. 


Theorem 3.1 (Grothendieck-Pietsch’s factorization, weighted version). Let B be a k x m 
real matrix. Then there exist positive weights pj with J2jL 1 hj = 1 such that 

Halloo—« < ll^; 1/2 H < V^\\B\\oo^ 2 , (3.1) 

where D M = diag(^) denotes the m x m diagonal matrix with weights pj on the diagonal. 

This result is a known combination of the Little Grothendieck Theorem (see [41, Corol¬ 
lary 10.10] and [38]) and Pietsch Factorization (see [41, Theorem 9.2]). In an explicit form, 
a version of this result can be found e.g. in [26, Proposition 15.11]. The weights pj can be 
computed algorithmically, see [42], 

The following related version of Grothendieck-Pietsch’s factorization can be especially 
useful in probabilistic contexts, see [26, Proposition 15.11]. Here and in the rest of the paper, 
we denote by Bj x j the sub-matrix of a matrix B with rows indexed by a subset I and columns 
indexed by a subset J. 


Theorem 3.2 (Grothendieck-Pietsch factorization, sub-matrix version). Let B be a k x m 
real matrix and 5 > 0. Then there exists J C [m] with \ J\ > (1 — 5)m such that 


l%xj|| < 


2||B| 


oo —>2 


m 


Proof. Consider the weights pi given by Theorem 3.1, and choose J to consist of the indices 
j satisfying pj < 1 /(5m). Since )TL Pj = 1, the set J must contain at least (1 — 5)m indices 

as claimed. Furthermore, the diagonal entries of ( D^ ' )j x j are all bounded from below by 
\J6m, which yields 

ll(- B - D / 7 1/2 ) [fc ]x j|| > ^rn\\B[ k ]xj\\- 

On the other hand, by (3.1) the left-hand side of this inequality is bounded by 2||H|| 00 _ > 2- 
Rearranging the terms, we complete the proof. □ 


3.3. Concentration on a big block. We are starting to work toward constructing the core 
part J\f in Theorem 2.6. In this section we will show how to find a big block on which the 
adjacency matrix A concentrates. First we will establish concentration in £^ -A £2 norm, 
and then, using Grothendieck-Pietsch factorization, in the spectral norm. 

The lemmas of this and next section should be best understood for m = n, I = J = [n] 
and a = 1. In this case, we are working with the entire adjacency matrix, and trying to 
make the first step in the iterative procedure. The further steps will require us to handle 
smaller blocks I x J; the parameter a will then become smaller in order to achieve better 
concentration for smaller blocks. 

Lemma 3.3 (Concentration in £^ -a £2 norm). Let 1 < m < n and a > m/n. Then for 
r > 1 the following holds with probability at least 1 — n~ r . Consider a block L x J of size 
rn x m. Let V be the set of indices of the rows of Aj x j that contain at most ad ones. Then 


||(H - EH)// x j || 00 ^2 < CV admr log (en/m). 


(3.2) 
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Proof. By definition, 

\\( y A-EA) rxJ \\l 0 ^ 2 = max V (- E A^xj) = max V (X^:) 2 (3.3) 

is/ 7 jeJ iei 

where we denoted 

^ := y- IE Aij)xj, & : = 1 {E i6 j A v/ <Qd}' 
i&J 

Let us first fix a block I x J and a vector x G {—1, l} m . Let us analyze the independent 
random variables Xj£j. Since |Xj| < YljeJ ~~ E | < it follows by definition 

of that 

|Xj£j| < ad. (3.4) 

A more useful bond will follow from Bernstein’s inequality. Indeed, X % is a sum of m 
independent random variables with zero means and variances at most d/n. By Bernstein’s 
inequality, for any t > 0 we have 

/ — m ,t 2 /2 \ 

P{|X^| > tm} < P{|Xj| > tm} < 2exp ( , , t > 0. (3.5) 

\d/n + t/dj 

For tm . < ad , this can be further bounded by 2exp(— m 2 t 2 / Aad), once we use the assumption 
a > m/n. For tm > ad, the left-hand side of (3.5) is automatically zero by (3.4). Therefore 

P {|X^| > tm} < 2 exp ( j , t > 0. (3.6) 

We are now ready to bound the right-hand side of (3.3). By (3.6), the random variable 
X^i is sub-gaussian 2 with sub-gaussian norm at most '/ad. It follows that (X*£j) 2 is sub¬ 
exponential with sub-exponential norm at most 2 ad. Using Bernstein’s inequality for sub¬ 
exponential random variables (see Corrollary 5.17 in [43]), we have 

P (Xj£j) 2 > < 2 exp [—cmin (e 2 ,^) m] , e > 0. (3-7) 

Choosing e := (10/c)r log(en/m), we bound this probability by (en/m)~ 5rm . 

Summarizing, we have proved that for fixed I, J C [n] and x G {—1, l} m , with probability 
at least 1 — (en/m) -5rm , the following holds: 

^ (Xj^j) 2 < (10/c)r log(en/m) • mad. (3.8) 

i£l 


Taking a union bound over all possibilities of m, /, J, x and using (3.3), (3.8), we see that the 
conclusion of the lemma holds with probability at least 

2 


1 - ^ 2 m 
m =1 


(en\ ~ 5rm 
— > 1 - n 

\ m / 


The proof is complete. 


□ 


Applying Lemma 3.3 followed by Grothendieck-Piesch factorization (Theorem 3.2), we 
obtain the following. 


2 


For definitions and basic facts about sub-gaussian random variables, see e.g. [43]. 
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Lemma 3.4 (Concentration in spectral norm). Let 1 < m < n and a > m/n. Then for 
r > 1 the following holds with probability at least 1 — n~ r . Consider a block I x J of size 
m x m. Let I' be the set of indices of the rows of Aj x j that contain at most ad ones. Then 
one can find a subset J' C J of at least 3m /4 columns such that 

|| (A — E A)jr xj i || < C\J adr log (en/m). (3-9) 


3.4. Restricted degrees. The two simple lemmas of this section will help us to handle the 
part of the adjacency matrix outside the core block constructed in Lemma 3.4. First, we 
show that almost all rows have at most 0(ad ) ones, and thus are included in the core block. 

Lemma 3.5 (Degrees of subgraphs). Let 1 < m < n and a > yfm/n. Then for r > 1 the 
following holds with probability at least 1 — n~ r . Consider a block I x J of size m x m. Then 
all but m/ad rows of Ai x j have at most 8 rad ones. 


Proof. Fix a block I x J, and denote by di the number of ones in the i- th row of Aj x j. Then 
E di < md/n by the assumption. Using Chernoff’s inequality, we obtain 


P {di > 8 rad} < 


8 rad \~ 8rad / 2an\~ 8rad 


emd/n 


< - 

V m / 


= : p. 


Let S be the number of rows i with di > 8 rad. Then S' is a sum of m independent Bernoulli 
random variables with expectations at most p. Again, Chernoff’s inequality implies 

P{S > m/ad} < ( epad) m/ad < p m/2ad = 

The second inequality here holds since ead < p -1 / 2 . (To see this, notice that the definition 


of p and assumption on a imply that p 1 A = (2 an/m) 


\4 rad 


> 2 


4 rad 


•) 


It remains to take a union bound over all blocks I x J. We obtain that the conclusion of 
the lemma holds with probability at least 


1 - 


m= 1 


2 an\ 


—Arm 


m 


■> 


>1 — n 


In the last inequality we used the assumption that a > yJ m/n. The proof is complete. □ 


Next, we handle the block of rows that do have too many ones. We show that most columns 
of this block have 0(1) ones. 

Lemma 3.6 (More on degrees of subgraphs). Let 1 < m < n and a > yjm/n. Then for 
r > 1 the following holds with probability at least 1 — n~ r . Consider a block I x J of size 
k x m with some k < m/ad. Then all but m/4 columns of Aj x j have at most 32 r ones. 


Proof. Fix I and J, and denote by dj the number of ones in the j-th column of Aj x j. Then 
Edj < kd/n < m/an by assumption. Using Chernoff’s inequality, we have 


P {dj > 32 r} < ( 


32r 

em/an 


—32 r 


< 


10cm\ ~ 32r 


m 


=: p. 


Let S be the number of columns j with dj > 32 r. Then S' is a sum of m independent Bernoulli 
random variables with expectations at most p. Again, Chernoff’s inequality implies 

/10 an\~ 5rm 
V m / 

The second inequality here holds since 4e < p 1//2 , which in turn follows by assumption on a. 


'{S>m/ 4} < (4ep) m/4 < p m/% < 
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It remains to take a union bound over all blocks / x J. It is enough to consider the 
blocks with largest possible number of columns, thus with k = \m/ad\. We obtain that the 
conclusion of the lemma holds with probability at least 



—5 rm 


< 1 - n~ r . 


In the last inequality we used the assumption that a 


> yjm/n. The proof is complete. 


□ 


3.5. Iterative decomposition: proof of Theorem 2.1. Finally, we combine the tools we 
developed so far, and we construct an iterative decomposition of the adjacency matrix the 
way we outline in Section 3.1. Let us set up one step of this procedure, where we consider 
an mx m block and decompose almost all of it (everything except anm/2xm/2 block) into 
classes Af, 1Z and C satisfying the conclusion of Theorem 2.6. Once we can do this, we repeat 
the procedure for the m/2 x m/2 block, etc. 

Lemma 3.7 (Decomposition of a block). Let 1 < m < n and a > y/mjn. Then for r > 1 
the following holds with probability at least 1 — 3 n~ r . Consider a block I x J of size m x m. 
Then there exists an exceptional sub-block 1 1 x Ji with dimensions at most m/2 x m/2 such 
that the remaining part of the block, that is (/ x J)\ (Ji x J\), can be decomposed into three 
classes Af, 1Z C (I \ Ji) x J and C C I x (J \ J±) so that the following holds. 

• The graph concentrates onAf, namely ||[A — Exl)jv r || < Cr 3 ^ 2 y/adlog(en/m). 

• Each row of An and each column of Ac has at most 32r ones. 

Moreover, 1Z intersects at most n/ad columns and C intersects at most n/ad rows of I x J. 

After a permutation of rows and columns, a decomposition of the block stated in Lemma 3.7 
can be visualized in Figure 4c. 




h 



Ji 


(b) Repeat for transpose. (c) Final decomposition. 

Figure 4. Construction of a block decomposition in Lemma 3.7. 


Proof. Since we are going to use Lemmas 3.4, 3.5 and 3.6, let us fix realization of the random 
graph that satisfies the conclusion of those three lemmas. 

By Lemma 3.5, all but m/ad rows of Aj x j have at most 8 rad ones; let us denote by I' the 
set of indices of those rows with at most 8 rad ones. Then we can use Lemma 3.4 for the block 
I' x J and with a replaced by 8ra; the choice of I' ensures that all rows have small numbers 
of ones, as required in that lemma. To control the rows outside V, we may use Lemma 3.6 
for (/ \ I') x J; as we already noted, this block has at most m/ad rows as required in that 
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lemma. Intersecting the good sets of columns produced by those two lemmas, we obtain a 
set of at most m/2 exceptional columns J\ C J such that the following holds. 

• On the block Ai := I' x (J \ Ji), we have || (A — E A)x x || < Cr 3 / 2 y 7 ad log(en/ m). 

• For the block C := (I \ I') x ( J \ J\ ), all columns of Ac have at most 32r ones. 
Figure 4a illustrates the decomposition of the block I x J into the set of exceptional columns 
indexed by J\ and good sets M\ and C. 

To finish the proof, we apply the above argument to the transpose A T on the block J x I. 
To be precise, we start with the set J' C J of all but m/ad small columns of Aj x j (those 
with at most 8rad ones); then we obtain an exceptional set I± C I of at most m/2 rows; and 
finally we conclude that A concentrates on the block A /2 := (I \ 1 1 ) x J' and has small rows 
on the block 1Z := (/ \ I\) x (J \ J'). Figure 4b illustrates this decomposition. 

It only remains to combine the decompositions for A and A T ; Figure 4c illustrates a result 
of the combination. Once we define M := J\f\ U A/ 2 , it becomes clear that A/", 7Z and C have 
the required properties. 3 □ 

Proof of Theorem 2.6. Let us fix a realization of the random graph that satisfies the conclu¬ 
sion of Lemma 3.7. Applying that lemma for m = n and with a = 1, we decompose the set of 
edges [n] x [n] into three classes A/o, Co and IZq plus an n/2 x n/2 exceptional block Ij x Jj. 
Apply Lemma 3.7 again, this time for the block I\ x Ji, for m = n/2 and with a = y / l/2. 
We decompose I\ x J\ into A/i, C\ and TZ\ plus an n/4 x n/4 exceptional block I 2 X J 2 - 

Repeat this process for a = y/rnjn where m is the running size of the block; we halve this 
size at each step, and so we have a* < 2 ~ 1 ! 2 . Figure 3c illustrates a decomposition that we 
may obtain this way. In a finite number of steps (actually, in O(logn) steps) the exceptional 
block becomes empty, and the process terminates. At that point we have decomposed the 
set of edges [n] x [n] into AT, 1Z and C, defined as the union of A/), Cj and IZi respectively, 
which we obtained at each step. It is clear that 1Z and C satisfy the required properties. 

It remains to bound the deviation of A on A f. By construction, A/) satisfies 

||(A -E A)j^f.\\ < Cr 3 ^ 2 y/ ajdlog(eaj). 

Thus, by triangle inequality we have 

IKA-EAVIIS^ Cr 3 / 2 y 7 aid log(ea,) < C'r 3 ^ 2 Vd. 
i> 0 

In the second inequality we used that q;* < 2 - */ 2 , which forces the series to converge. The 
proof of Theorem 2.6 is complete. □ 

3.6. Replacing the degrees by the £2 norms in Theorem 2.1. Let us now prove the 
“moreover” part of Theorem 2.1, where d' is the the maximal £2 norm of the rows and columns 
of the regularized adjacency matrix A'. This is clearly a stronger statement than in the main 
part of the theorem. Indeed, since all entries of A' are bounded in absolute value by 1, each 
degree, being the £\ norm of a row, is bounded below by the £2 norm squared. 

This strengthening is in fact easy to check. To do so, note that the definition of d! was 
used only once in the proof of Theorem 2.1, namely in Step 2 where we bounded the norms 
of A ^ and A' c . Thus, to obtain the strengthening, it is enough to replace the application of 
Lemma 2.7 there by the following lemma. 

3 It may happen that an entry ends up in more than one class M, 1Z and C. In such cases, we split the tie 

arbitrarily. 







17 


Lemma 3.8. Consider a matrix B with entries in [0,1]. Suppose each row of B has at most 
a non-zero entries, and each column has l 2 norm at most y/b. Then ||L>|| < yfab. 

Proof. To prove the claim, let x be a vector with ||o ;||2 = 1- Using Cauchy-Schwarz inequality 
and the assumptions, we have 

2 


\bx\\ 2 2 = 

j 

sE 


E - E 


E 

i: Bij^= 0 




■ E ■ 

i: Bij^= 0 


= b J2 X i 


J2 x i 

i. Bij ^0 

1 < b x 2 a = ab. 

j'- Bij 7^0 i 


Since x is arbitrary, this completes the proof. 

4. Concentration of the regularized Laplacian 


□ 


In this section, we state the following formal version of Theorem 1.2, and we deduce it 
from concentration of adjacency matrices (Theorem 2.1). 

Theorem 4.1 (Concentration of regularized Laplacians). Consider a random graph from the 
inhomogeneous Erdds-Renyi model, and let d be as in (1.3). Choose a number r > 0. Then, 
for any r > 1 , with probability at least 1 — e~ r one has 

Cr 2 

||£(H T )-£(EA r )|| < 




d\ 5/2 

1 + r) ' 


Proof. Two sources contribute to the deviation of Laplacian - the deviation of the adjacency 
matrix and the deviation of the degrees. Let us separate and bound them individually. 

Step 1 . Decomposing the deviation. Let us denote A := E A for simplicity; then 

E := C(A t ) - C(A t ) = D-^ArDf 1 / 2 - Df l / 2 A T Df 1 / 2 . 

Here D r = diag(dj + r) and D r = diag(dj + r) are the diagonal matrices with degrees of A r 
and A t on the diagonal, respectively. Using the fact that A r — A r = A — A, we can represent 
the deviation as 

E = S + T where S = Df 1 ^ 2 (A - A)Df 1 / 2 , T = Df l / 2 A T Df 1 / 2 - Df l / 2 A T Df 1 / 2 . 
Let us bound S and T separately. 

Step 2. Bounding S. Let us introduce a diagonal matrix A that should be easier to 
work with than D T . Set An = 1 if d t < 8 rd and An = di/r + 1 otherwise. Then entries of 
r A are upper bounded by the corresponding entries of D T , and so 

r||S|| < ||A~ 1 /2 (^_^) A -i/2^ 

Next, by triangle inequality, 

Tils'll < ||A” 1/2 HA~ 1/2 - A\\ + \\A - A~ 1/2 HA” 1/2 || =: i?i + R 2 . (4.1) 

In order to bound R\, we use Theorem 2.1 to show that A! := A _1 / 2 TA -1 / 2 concentrates 
around A. This should be possible because A! is obtained from A by reducing the degrees 
that are bigger than 8 rd. To apply the “moreover” part of Theorem 2.1, let us check the 
magnitude of the l 2 norms of the rows A'- of A': 


I 4/112 _ 
I -‘n 112 — 


a~a~ - AT- max(8rd ’ r) - 
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Here in the first inequality we used that A jj > 1 and Yhj Aij = <k; the second inequality 
follows by definition of A&. Applying Theorem 2.1, we obtain with probability 1 — n~ r that 

Ri = \\A'-A\\ <Cir 2 (Vd + V^)- 

To bound Ro, we note that by construction of A, the matrices A and A -1 / 2 AA -1 / 2 coincide 
on the block I x /, where I is the set of vertices satisfying di < 8 rd. This block is very large 
- indeed, Lemma 3.5 implies that \I C \ < n/d with probability 1 — n~ r . Outside this block, 
i.e. on the small blocks I c x [n] and [n] x I c , the entries of A — A -1 / 2 AA -1 / 2 are bounded 
by the corresponding entries of A, which are all bounded by d/n. Thus, using Lemma 2.7, 
we have 

R 2 < ||^4/ c x[n] II + ll^[n]x/ c ll A 2 Vd. 

Substituting the bounds for R\ and R 2 into (4.1), we conclude that 

Cor 2 

ll'S'll < - {Vd + y/r) 

T 

with probability at least 1 — 2 n~ r . 


Step 3. Bounding T. Bounding the spectral norm by the Hilbert-Schmidt norm, we get 

n 

ll^ll < II^IIhs = Y T ip where Tij = (Aij + r/n) 

ij =1 

and 5ij = (di + r)(dj + r) and S{j = (di + r)(dj + t). To bound 1'ij, we note that 



0 < A^ + t/ n < 


d + t 
n 


and 





> 


\$ij ~ S-, 


ij\ 


2 T 3 


Recalling the definition of 5ij and 5ij and adding and subtracting (di + r)(dj + r), we have 
djj 5ij — (di -\- t ) (dj djj ) -)- ((Ij T t) ( dj dj ). 

So, using the inequality (a + b ) 2 < 2(a 2 + b 2 ) and bounding dj + r by d + r, we obtain 


imi 2 < 


(d + T ) 2 


[^(di+rf^dj 

i= 1 3 = 1 


dj) 2 + n(d + t) 2 ^ ~2(di - di) 2 
i= 1 


(4.2) 


We claim that 

n 

Ytfj — dj) 2 < C^r 2 nd with probability 1 — e~ 2r . (4.3) 

3= 1 

Indeed, since the variance of each di is bounded by d, the expectation of the sum in (4.3) is 
bounded by nd. To upgrade the variance bound to an exponential deviation bound, one can 
use one of the several standard methods. For example, Bernstein’s inequality implies that 
Xj = dj — dj satisfies P jw > CR-v/dj < e~* for all t > 1. This means that the random 
variable X 2 belongs to the Orlicz space L^ 1/2 and has norm ||A 2 ||^ 1/2 < C;$d. see [26]. By 
triangle inequality, we conclude that || X^=i -^iWi’ 1/2 — ^ 3 n d, which in turn implies (4.3). 
Furthermore, (4.3) implies 


n n n 

"Y^dj + t ) 2 < 2 ~ di) 2 + 2 Y(di + t ) 2 < 2Csr 2 nd + 2 n(d + r) 2 < C^r 2 n(d + r) 2 . 

i= 1 2—1 2=1 
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Substituting this bound and (4.3) into (4.2) we conclude that 

\2 


| T m2 < (d + ry 


n 2 r 6 


C^r z nd 


C^r z n{d + rf + n{d + t)“ 


< 


C e r 4 


d\ 5 
1 + r 


It remains to substitute the bounds for S and T into the inequality ||£'|| < ||S I || + ||T||, and 
simplify the expression. The resulting bound holds with probability at least 1 — n~ r — n~ r — 


0 —2r 


> 1 — e r , as claimed. 


□ 


5. Further questions 

5.1. Optimal regularization? The main point of our paper was that regularization helps 
sparse graphs to concentrate. We have discussed several kinds of regularization in Section 1.4 
and mentioned some more in Section 1.4. We found that any meaningful regularization works, 
as long as it reduces the too high degrees and increases the too low degrees. Is there an optimal 
way to regularize a graph? Designing the best “preprocessing” of sparse graphs for spectral 
algorithms is especially interesting from the applied perspective, i.e. for real world networks. 

On the theoretical level, can regularization of sparse graphs produce the same optimal 
bound 2y/d(l + o(l)) that we saw for dense graphs in (1.1)? Thus, an ideal regularization 
should bring all parasitic outliers of the spectrum into the bulk. If so, this would lead to a 
potentially simple spectral clustering algorithm for community detection in networks which 
matches the theoretical lower bounds. Algorithms with optimal rates exist for this problem 
[33, 29], but their analysis is very technical. 

5.2. How exactly concentration depends on regularization? It would be interesting 
to determine how exactly the concentration of Laplacian depends on the regularization pa¬ 
rameter t. The dependence in Theorem 4.1 is not optimal, and we have not made efforts to 
improve it. Although it is natural to choose r ~ d as in Theorem 1.2, choosing r d could 
also be useful [23]. Choosing r«d may be interesting as well, for then £(E d T ) ~ £(E4) 
and we obtain the concentration of C(A T ) around the Laplacian of the expectation of the 
original (rather than regularized) matrix E A. 

5.3. Average expected degree? Both concentration results of this paper, Theorems 1.1 
and 1.2, depend on d = rriax ? ;j n'p Vj . Would it be possible to reduce d to the maximal expected 
degree d ave = max, 

5.4. From random graphs to random matrices? Adjacency matrices of random graphs 
are particular examples of random matrices. Does the phenomenon we described, namely 
that regularization leads to concentration, apply for general random matrices? Guided by 
Theorem 1.1, we might expect the following for a broader general class of random matrices 
B with mean zero independent entries. First, the only reason the spectral norm of B is too 
large (and that it is determined by outliers of spectrum) could be the existence of a large row 
or column. Furthermore, it might be possible to reduce the norm of B (and thus bring the 
outliers into the bulk of spectrum) by regularizing in some way the rows and columns that 
are too large. For related questions in random matrix theory, see the recent work [4, 21]. 
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